TL;DR

What will I learn?

This tutorial is a primer, on how to go modular and use Docker 🐳 containers for your bioinformatics analysis tasks. More specifically, we will learn how to use the Deploit platform to assemble and deploy on cloud a reproducible and sharable bioinformatics workflow.

We will assemble the following resources into a workflow on the Deploit platform:

  • 🐳 a Docker container, from DockerHub
  • 🚀 a curated pipeline, available on the Deploit platform
  • 📝 a 1000genomes project fastq.gz file, fetched from an EMBL-EBI FTP site

Last things first: What results will I get from the FastQC tool on Deploit?

After completing this mini workflow on the Deploit platform, you will have:
1) a plot-full FastQC html report,
with key metrics to assess the quality of your FASTQ file

2) sharable links to your Job Page s ,
with interactive plots and information about the resources and the results.

You can access the Jobs Pages from the sharable urls we created for this example:

  • for retrieving the 1000genomes fastq.gz file : 🔗 wgetGunzipper
  • for generating the FastQC html report: 🔗 fastQsee

and take a look at what a Job Page looks like below:

For every job run on Deploit, a Job Page report is created.

Why not install everything I need and run FastQC on my machine instead?

In principle, you could actually easily install the dependencies for running FastQC on your own machine. But this tutorial is more about learning how to use Deploit to easily combine resources (code from GitHub, containers from Docker Hub) to assemble multi-step bioinformatics workflows. This fastQsee pipeline will serve as our dummy example to go through the steps. Deploit enables you to structure your workflows as an assembly of individual, self contained units of computation (jobs), by bringing all the required resources to run an analysis (data, code, os & tools, computational resources) in one place. Each step in a bioinformatics workflow, will most likely utilize different tools and have different dependencies. But why not install all the tools on one machine, and run everything there, right? Well, for starters, dependencies!



We all know that it’s a hustle to make all the tools play nice together. There are several reasons why bioinformaticians have started joining developers and data scientists, and are slowly abandoning the monolithic, all-in-one place analysis environment approach and shifting towards more robust, modular and portable environment solutions. While virtualization and containers have been around for quite some time, Docker has really revolutionized the way we work the past few years. In a bioinformatics workflow, each process can utilize a different container as an execution microenvironment, with a main focus on preventing dependencies conflicts and ensuring reproducibility.

This modularity also unlocks many cool features as a positive side effect:

1) Portability : Installation (and hustle!) free run on any another machine
2) Cloudability : Easily deploy on cloud
3) Reproducibility : Allow for someone else to run the shame pipeline
4) Frictionless Retouching: Allows for easily removing, adding or retouching individual processes without affecting the rest
5) Isolation of Dependencies: Conflicting dependencies are isolated
6) Same tool, different tool version: Ability to use the same tool, but a different version of it in different processes if needed (legacy code in bioinformatics tools anyone?)

Overview

Why do I need a FastQC report?

The analysis journey of NGS generated FASTQ files should always start, as with any other data analysis task, with a robust E xploratory D ata A nalysis (EDA) bout.



The FastQC tool (Andrews S. et al, 2010), facilitates this task by providing a plot-ful html report, with key metrics for read quality.

FastQC quality assesment plots


If not familiar with how to interpret the plots, you can start by taking a look at the good and bad quality sequencing data examples that are available in the FastQC tool official webpage. There is also a really great presentation on “RNA-seq quality control and pre-processing” by Mikael Huss you can check out here:

Step-By-Step Tutorial: Deploit in action (aka in gifs!)

Now that we have an overview of how Deploit brings your resources in one place, and we have found what resources we will need, time to go back to the fastQsee tutorial to generate the FastQC report.

All the resources that we will need can be summarized in the following table:

What Where
DATA SRR062634.fastq.gz (1000genomes example file)
CODE wget -O - https://lifebit.page.link/ftp_SRR062634_fastq_gz | gunzip -c > SRR062634.fastq
OS/TOOLS frolvlad/alpine-bash & lifebitai/fastqc Docker containers
RESOURCES Lifebit Cloud (provided with registration)

Let’s head over to the Deploit platform to generate the FastQC report step-by-step.

STEP 0: Create a Project for your analysis tasks

For generating the FastQC report, we will deploy two jobs:
1) One for retrieving the file
2) One for running the FastQC tool.

It is advised to create a Project to host the individual tasks/jobs of a workflow. Think of the Project entity in the Deploit platform, as your parent directory for the project. There, you will have access not only to the data and code, but also to all the Jobs that have been run. You will have the ability to revisit, clone and deploy again the same jobs very easily.


Log in to your Deploit account and access the Projects section from the light bulb icon 💡 on the left of your screen. Click on the green New button on the right, and provide a Name and Description for your new Project.

We set up the Project for this example by filling in:

  • Name: “fastQsee”
  • Description: “Quickly generate a FastQC report”

You can find an overview of how these steps look on Deploit below:

STEP 1: Port the frolvlad/alpine-bash Docker container to Deploit

After you log in to Deploit, find the Pipelines section from the navigation bar on the left of your screen. We will create a new pipeline in the Pipelines > My Pipelines & Tools section.


This is how we will port the frolvlad/alpine-bash Docker container to Deploit, so that we can it use it for retrieving the fastq.gz file. Have the link to the Docker repository ready for copy+pasting
URL to Docker Hub: hub.docker.com/r/frolvlad/alpine-bash Have the link to the Docker repository ready for copy+pasting and then go ahead and click the green New button on the right.

As shown above, you will be prompted to select “Where are you porting your pipeline from?”. Click on the Docker whale and then click on Select to proceed. Continue by filling in the required fields to port the container. You can have a look at how we set this up below as an example:

The Default command is handy for saving time, but for now we will leave this blank so that we can use the command field as a terminal.

You can see an overview of the process described above in the following gif:

STEP 2: Use the newly created pipeline to retrieve and uncompress the fastq.gz file on Deploit

Time to use our newly created pipeline, and utilize the Docker container that we ported. We will use the command field on Deploit as a terminal, and the Docker container as our operating system to download and decompress the fastq.gz file we selected for this example. We will then deploy the job and come back to find our FASTQ file in the Data > Job Results section on the Deploit platform.

We will run the following command as mentioned earlier:

wget -O - https://lifebit.page.link/ftp_SRR062634_fastq_gz | gunzip -c > SRR062634.fastq

Have it ready for copy+pasting, and let’s go back to the Deploit platform. Access the newly created pipeline that utilizes the Alpine Linux docker container (we named ours wgetGunzipper) by clicking:

Pipelines > My Pipelines & Tools > wgetGunzipper

Paste the command from above in the Executable field. Take a look at how this will look on the Deploit platform below:

You can see an overview of the final command in the bottom of the screen:


Now we are ready to deploy the first of the two jobs for retrieving the FastQC report.

Click on Next on the top right of your screen. You will be redirected to the page where you will:

  1. Select Project, in which your job belongs to (we selected fastQsee)
  2. Select instance for deploying your job

When you select both, you are ready to submit your job. Go ahead and click Run job. You will be redirected to the Jobs page. Your job will be scheduled, initialized and be completed shortly. Take a look below to check how these steps should look in the Deploit platform.

After job completion, you can access the decompressed FASTQ file in the Data > Job Results section. We expect to find our FASTQ file in the fastQsee Project folder, with a filename as we defined it when retrieving it SRR062634.fastq.

Reminder: This is the command we submitted to download, decompress and rename the fastq.gz file:

STEP 3: Use the uncompressed FASTQ file as input to lifebitai/fastqc

Our input file for the next and last step, the uncompressed FASTQ file is now available on the Deploit platform in the Data > Job Results section. We will use this file as input for the lifebitai/fastqc pipeline, which is essentially a dockerized version of the FastQC tool and its dependencies. In the Pipelines > PUBLIC PIPELINES & TOOLS section, start typing "fastqc" in the search bar to easily find the lifebitai/fastqc pipeline.



Time to set up the pipeline. Let’s:

  1. define input data
  2. select Project that the pipeline will be associated with
  3. select instance (cloud resources)

As shown above we have selected as input data, the 1000geomes FASTQ file we fetched from the EMBL-EBI FTP server in the previous step. None other parameteres are required to run the fastqc pipeline. We then selected an instance (anything larger than 1 CPU would work).

STEP 3 - Quick tip: Easily run available pipelines on Deploit

All curated pipelines included in Deploit’s library, in the Pipelines > PUBLIC PIPELINES & TOOLS section, come with example parameters and data. Click
Try with example data & parameters to:

1. Figure out the required flags and arguments to run the pipeline

from the Deploit fields:

or

from the final command field:

This way you can see how the command should look like when you customize the pipeline with your own parameters and input data. For example, as shown above we can see that the lifebitai/fastqc pipeline can be run just by typing:

fastq name_of_my_fastq_file.qc

Notice the .fq file format required and None other parameters.

2. Take a look at output files of new pipelines

You can run with example parameters & data just to check the output files that the pipeline generates, and explore new pipelines and tools you haven’t used before for your own omics data. You might discover another way you can interrogate your omics data and generate more results to inspect.

STEP 4: Job completed! Time to inspect the FastQC html report

Once the job has been completed, you can access your results from the Job Page, as shown below:

The Job Page serves like a summary report that includes information about:

  • `` Pipeline and configuration```: Name of Pipeline and configuration selected for this job
  • Job status: Job progress (% complete) or if it has failed
  • Resource Monitor: CPU usage, RAM requirements
  • Table overview: Runtime, number of processes, total cost

It also works as a portal to access all the generated output files in the Results section. Every job is assigned a unique Job ID, so that you can reference back to them or to retrieve programmatically information about the job using your private key through Deploit’s restful API.

STEP 5: Share, reproduce, re-use

Share

You can also share your Job page and hence analysis results by creating a public sharable link from the Job page as shown below:




Anyone that has your public link can access your Job page. Want to stop sharing? No worries! Just make the page private again and the url will lead to a cute tardigrade 404 page.
Take a look below:


Reproduce:

Immutable Docker digests

In the Job page you can easily access the Docker image digest.

Digests are unique immutable Docker container identifiers and we choose to use those to ensure that the image we use when cloning a job is always the same. Tags are mutable and they don’t guarantee reproducibility. You can read more about it in the official Docker Docs page here.



GitHub commits

If you port a pipeline from GitHub, you also have access to the GitHub commit so that you know exactly what revision you have used for running you’re pipeline. Straight from the GitHub glossary

You can access it from the Job Page > Pipeline and configuration> Revision as shown below:

Re-use

If you have run a Job on the Deploit platform, you can easily use the clone feature to revisit your pipeline. By clicking on clone you will get a different Job ID but an automatic identical configuration set-up . You will be redirected in the job submission page.

Take a look below:

After cloning, you can easily modify the pipeline configuration as little (eg. change input file and keep all other parameters the same) or as much (eg. modify everything by changing the parameter values) as you like. This helps you quickly modify and re-use pipelines.

Acknowledgements

Many thanks to Phil Palmer and Diogo Silva for reproducibility and Docker feedback.

sessionInfo

sessioninfo::package_info()

##  package     * version    date       lib source                         
##  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.2)                 
##  base64enc     0.1-3      2015-07-28 [1] RSPM (R 3.5.2)                 
##  cli           1.0.1      2018-09-25 [1] CRAN (R 3.5.2)                 
##  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.2)                 
##  digest        0.6.18     2018-10-10 [1] RSPM (R 3.5.2)                 
##  emo           0.0.0.9000 2019-01-29 [1] Github (hadley/emo@02a5206)    
##  evaluate      0.12       2018-10-09 [1] RSPM (R 3.5.2)                 
##  glue          1.3.0.9000 2019-01-29 [1] Github (tidyverse/glue@8188cea)
##  htmltools     0.3.6      2017-04-28 [1] RSPM (R 3.5.2)                 
##  htmlwidgets   1.3        2018-09-30 [1] CRAN (R 3.5.2)                 
##  jsonlite      1.6        2018-12-07 [1] RSPM (R 3.5.2)                 
##  knitr         1.21       2018-12-10 [1] RSPM (R 3.5.2)                 
##  lubridate     1.7.4      2018-04-11 [1] RSPM (R 3.5.2)                 
##  magrittr      1.5        2014-11-22 [1] RSPM (R 3.5.2)                 
##  purrr         0.3.0      2019-01-27 [1] CRAN (R 3.5.2)                 
##  Rcpp          1.0.0      2018-11-07 [1] RSPM (R 3.5.2)                 
##  rlang         0.3.1      2019-01-08 [1] CRAN (R 3.5.2)                 
##  rmarkdown     1.11       2018-12-08 [1] RSPM (R 3.5.2)                 
##  rstudioapi    0.9.0      2019-01-09 [1] CRAN (R 3.5.2)                 
##  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.2)                 
##  slickR        0.2.4      2018-03-06 [1] CRAN (R 3.5.2)                 
##  stringi       1.2.4      2018-07-20 [1] RSPM (R 3.5.2)                 
##  stringr       1.3.1      2018-05-10 [1] RSPM (R 3.5.2)                 
##  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.2)                 
##  xfun          0.4        2018-10-23 [1] RSPM (R 3.5.2)                 
##  xml2          1.2.0      2018-01-24 [1] CRAN (R 3.5.2)                 
##  yaml          2.2.0      2018-07-25 [1] RSPM (R 3.5.2)                 
## 
## [1] /home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.5
## [2] /opt/R/3.5.2/lib/R/library

sessioninfo::platform_info()

## Warning in system("timedatectl", intern = TRUE): running command
## 'timedatectl' had status 1
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Ubuntu 16.04.5 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  C.UTF-8                     
##  ctype    C.UTF-8                     
##  tz       Etc/UTC                     
##  date     2019-02-03

sessioninfo::os_name()

## [1] "Ubuntu 16.04.5 LTS"

The End

## Warning in value[[3L]](cond): beep() could not play the sound due to the following error:
## Error in play.default(x, rate, ...): no audio drivers are available

```